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AWARDS ABSTRACT 


A neural network is trained to output a time dependent 
target vector defined over a predetermined time interval in 
response to a time dependent input vector defined over the 
same time interval by applying corresponding elements of 
5 the error vector, or difference between the target vector 

and the actual neuron output vector, to the inputs of 
corresponding output neurons of the network as corrective 
feedback. This feedback decreases the error and quickens 
the learning process, so that a much smaller number of 
10 training cycles are required to complete the learning 

process. A conventional gradient descent algorithm is 
employed to update the neural network parameters at the end 
of the predetermined time interval. The foregoing process 
is repeated in repetitive cycles until the actual output 
15 vector corresponds to the target vector. In the preferred 

embodiment, as the overall error of the neural network 
output decreases during successive training cycles, the 
portion of the error fed back to the output neurons is 
decreased accordingly, allowing the network to learn with 
20 greater freedom from teacher forcing as the network 

parameters converge to their optimum values. The invention 
may also be used to train a neural network with stationary 
training and target vectors. 
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FAST TEMPORAL NEURAL LEARNING USING 
TEACHER FORCING 

BACKGROUND OF THE INVENTION 
Origin of the Invention: 

The invention described herein was made in the performance of work 
under a NASA contract, and is subject to the provisions of Public Law 
96-517 (35 USC 202) in which the contractor has elected not to retain 
title. 


Technical Field: 

The invention relates to training neural networks with time depen- 
dent phenomena and to the problems associated therewith, including 
reducing the number of computations required and increasing the qual- 
ity or fidelity of the neural network output. 

Background Art: 

Recently, there has been a tremendous interest in developing learn- 
ing algorithms capable of modeling time-dependent phenomena. In par- 
ticular, considerable attention has been devoted to capturing the dynam- 
ics embedded in observed temporal sequences. 


In general, the neural architectures under consideration may be clas- 
sified into two categories: 

* Feedforward networks, in which back propagation through time can 
be implemented. This architecture has been extensively analysed' 
and is widely used in simple applications due, in particular, to the 
straightforward nature of its formalism. 

* Recurrent networks, also referred to as feedback or fully connected 
networks, which are currently receiving increased attention. A key 
advantage of recurrent networks lies in their ability to use informa- 
tion about past events for current, computations. Thus, they can 
provide time-dependent outpu trfor both time-dependent as wel 
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time-independent inputs. 


One may argue that, for many real world applications, the feedfor- 
ward networks suffice. Furthermore, a recurrent network can, in prin- 
ciple, be unfolded into a multilayer feedforward network. A detailed 
analysis of the merits and demerits of these two architectures is beyond 
the scope of this specification. Here, we will focus only on recurrent 
networks. 

The problem of temporal learning can typically be formulated as 
a minimization, over an arbitrary but finite time interval, of an appro- 
priate error functional. The gradients of the functional with respect to 
the various parameters of the neural architecture, e.g., synaptic weights, 
neural gains, etc. are essential elements of the minimization process and, 
in the past, major efforts have been devoted to the efficacy of their com- 
putation. Calculating the gradients of a system's output with respect to 
different parameters of the system is, in general, of relevance to several 
disciplines. Hence, a variety of methods have been proposed in the liter- 
ature for computing such gradients. A recent survey of techniques which 
have been considered specifically for temporal learning can be found in 
Pearlmutter, B.A. (1990) "Dynamic recurrent neural networks,” Tech- 
nical Report CMU-CS-90-196, School of Computer Science, Carnegie 
Mellon University, Pittsburgh, PA. We will briefly mention only those 
which are relevant to the present invention. 

Sato proposed, at the conceptual level, an algorithm based upon 
Lagrange multipliers. However, his algorithm has not yet been validated 
by numerical simulations, nor has its computational complexity been 
analyzed. Williams and Zipser [Williams, R.J.. and Zipser, D. (1989) 
"A learning algorithm for continually running fully recurrent neural net- 
works”, Neural Computation , Vol.l, No. 2, pp. 270-280] presented a 
scheme in which the gradients of an error functional with respect to net- 
work parameters are calculated by direct differentiation of the neural 
activation dynamics. This approach is computationally very expensive 
and scales poorly to large systems. The inherent advantage of the scheme 
is the small storage capacity required, which scales as 0(iV 3 ), where N 



denotes the size of the network. 


Pearlmutter, on the other hand, described a variational method 
which yields a set of linear ordinary differential equations for backpropa- 
gating the error through the system. These equations, however, need to 
be solved backwards in time, and require temporal storage of variables 
from the network activation dynamics, thereby reducing the attractive- 
ness of the algorithm. Recently, the inventors herein [Toomarian, N. 
and Barhen, J. (1991) ” Adjoint operators and non-adiabatic algorithms 
in neural networks,” Applied Mathematical Letters, Vol. 4, No. 2, pp. 
69-73] suggested a framework formalism wdiich enables the error propa- 
gation system of equations to be solved forward in time, concomitantly 
with the neural activation dynamics. A drawback of this novel approach 
came from the fact that their equations had to be analyzed in terms 
of distributions, which precluded straightforward numerical implemen- 
tation. Finally, Pineda proposed combining the existence of disparate 
time scales with a heuristic gradient computation. The underlying adia- 
batic assumptions and highly approximate gradient evaluation technique, 
however, placed severe limits on the applicability of his method. 

Analogy to real-life behavior motivates the learning paradigm of the 
present invention described below. Suppose that a parent wants to teach 
his child to ride a bicycle. Clearly, the parent will not stay home, let his 
child ride the bicycle and, from time to time, tell him how good or bad he 
is performing (just as it happens in classical supervised learning). The 
best way to train the child would be for the parent to accompany him 
during the riding sessions. This suggests that different dynamical sys- 
tems should be considered for the two basic stages of learning and recall 
(or generalization). However, the functional form of the neural dynamics 
used during the learning stage should smoothly evolve toward the func- 
tional form of the neural dynamics to be used during recall, after training 
is completed. In this context, the network dynamics during the learn- 
ing stage should include an instantaneous signal from the teacher on its 
performance. This necessitates a mechanism for incorporating informa- 
tion regarding the desired output directly into the activation dynamics. 



Such a mechanism has been referred to as teacher forcing. Williams and 
Zipser [Williams, R.J., and Zipser, D. (1988) “A learning algorithm for 
continually running fully recurrent neural networks,” Technical Report 
ICS Report 8805, UCSD, La Jolla, CA 92093], to the best of our knowl- 
edge, have been the primary users of teacher forcing. They limited their 
algorithm to a discrete- time problem, replacing the output of the net- 
work with desired output values at each time step. 

SUMMARY OF THE INVENTION 

The present invention is a new continuous form of teacher forcing . 
and appropriately modifies the activation dynamics of a simple additive 
neural network during its learning stage. The temporal modulation of 
teacher forcing is analyzed as learning proceeds, so that the activation 
dynamics of the learning stage can actually be reduced to the activation 
dynamics of the recall stage. 

In accordance with the invention, a neural network is trained to out- 
put a time dependent target, vector defined over a predetermined time 
interval m response to a time dependent input vector defined over the 
same time interval by applying corresponding elements of the error vec- 
tor, or difference between the target vector and the actual neuron output 
vector, to the inputs of corresponding output neurons of the network as 
corrective feedback. This feedback decreases the error and quickens the 
learning process, so that a much smaller number of training cycles are 
required to complete the learning process. The learning process employs 
a conventional gradient, descent, algorithm to update the neural network 
parameters (e.g., synapse weights and/or neuron gains) at the end of the 
time interval. The foregoing process is repeated in repetitive cycles until 
the actual output vector corresponds to the target vector. It has been 
found that not only is the number of required training cycles decreased 
but that the quality or fidelity of the neural network output is signifi- 
cantly increased by the invention. In the preferred embodiment, as the 
overall error of the neural network output decreases during successive 
training cycles, the portion of the error fed back to the output neurons 
is decreased accordingly, allowing the network to learn with greater free- 



dom from teacher forcing as the network parameters converge to their 
optimum values. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. la is a diagram of a neural network of the prior art. 

5 Fig. lb is a diagram of a neural network training architecture of the 

prior art. 

Fig. 2 is a diagram of a neural network numerical training archi- 
tecture of the prior art including error feedback which nulls the error at 
10 each numerical step. 

Fig. 3 is a time domain diagram illustrating the behavior of the 
neural network training architecture of Fig. 2. 

15 Fig. 4 is a simplified diagram of a neural network training architec- 

ture embodying the present invention. 

Fig. 5 is a time domain diagram illustrating the behavior of the 
neural network training architecture of Fig. 4. 

20 

Fig. 6 is a system diagram corresponding to the neural network 
training architecture of Fig. 4. 

Fig. 7 is a flow diagram illustrating the operation of the neural 
25 network training architecture of Fig. 4 using a generic gradient descent 
algorithm for computing the neural network parameter changes during 
training. 

Fig. 8 is a system diagram illustrating a preferred embodiment of 
the system of Fig. 5. 

Fig.’s 9a and 9b together constitute a flow diagram illustrating the 
operation of the neural network training architecture of Fig. 4 for an em- 
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bodiment employing a particular type of conventional gradient descent 
algorithm. 

Fig. s 10, 11 and 12 illustrate different simulation results of a neural 
network learning a circular motion using the invention. 

Fig. 13 is a graph of the error as a function of the number of learn- 
ing iterations for each of the cases illustrated in Fig.’s 10-12. 

Fig.’s 14, 15 and 16 illustrate different simulation results of a neural 
network learning a figure-eight motion using the invention. 


Fig. 17 is a graph of the error as a function of the number of learn- 
ing iterations for each of the cases illustrated in Fig.’s 14-16. 


DETAILED DESCRIPTION OF THE INVENTION 
Temporal Learning Framework: 

We formalize a neural network as an adaptive dynamical system 
whose temporal evolution is governed by the following set of coupled 
nonlinear differential equations: 

«n + «„ «„ = 9n{lnC^T nm U m + /„)] t > 0 (1) 

m 

where u n represents the output of the nth neuron (u n ( 0 ) being the initial 
state), and T nm denotes the strength of the synaptic coupling from the 
m-th to the n-th neuron. The constants K n characterize the decay of 
neuron activities. The sigmoidal functions g n (-) modulate the neural re- 
sponses, with gain given by 7 ,,; typically, g n { j n x) = tanh( 7 n ;r). In order 
to implement a nonlinear functional mapping from an Nj- dimensional 
input space to an A r 0 -dimensional output space, the neural network is 
topographically partitioned into three mutually exclusive regions. As 
shown in Figure la, the partition refers to a set of input neurons 5 /, a 
set of output neurons S 0 , and a set of “hidden” neurons S H . Note that 
this architecture is not formulated in terms of “layers”, and that each 



neuron may be connected to all others including itself. 
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10 


Let a(t) (the overhead bar denotes a vector) be an N- dimensional 
vector of target temporal patterns, with non zero elements, a n (t ), in the 
input and output sets only. When trajectories, rather than mappings, 
are considered, components in the input set may also vanish. Hence, 
the time- dependent external input term in Eq. (1), i.e., I n (t), encodes 
component-contribution of the target temporal pattern via the expres- 
sion 
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To proceed formally with the development of a temporal learning 
algorithm, we consider an approach based upon the minimization of an 
error functional, E. defined over the time interval [t 0 , tf] by the following 
expression 

E(il,p) = [ l y] e 2 n dt= f ' Fdt (3) 

Jt o Z „ Jto 

where the error component, e„(f), represents the difference between the 
desired and actual value of the output neurons, i.e., 


Cji (t) 
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In our model, the internal dynamical parameters of interest are the 
strengths of the synaptic interconnections, T nm , the characteristic decay 
constants, k u , and the gain parameters, 7 „. They can be represented as 
a vector of M [where : M = N~ -f- 2A'] components 

P = { 2^11, • * • , X/v/v, «!, • • • , ftyv, 7i 5 • ' • ,1n } (5) 

We will assume that the elements of p are statistically independent. Fur- 
thermore, we will also assume that, for a specific choice of parameters 
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and set of initial conditions, a unique solution of Eq. (1) exists. Hence, 
the state variables u are an implicit function of the parameters p. In the 
rest of this paper, we will denote the p th element of the vector p by 

(p = 1 , * - * , M). 


Traditionally, learning algorithms axe constructed by invoking Lya- 
5 punov stability arguments, i.e., by requiring that the error functional be 
monotonically decreasing during learning time, r. This translates into 
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dr 


dr 


< 0 


( 6 ) 


One can always choose, with r? > 0 

d Pn 


dr 


= -n 


dE_ 

d Pn 


(7) 


which implements learning in terms of an inherently local minimization 
procedure. Attention should be paid to the fact that Eqs. (1) and 
(7) may operate on different time scales, with parameter adaptation 
occurring at a slower pace. Integrating the dynamical system, Eq.(7), 
over the interval [r, r + At], one obtains, 


p M (r -f At) =p m (t) - rj 


t + At 


dE 

dPn 


dr 


(8) 


Equation (8) implies that, in order to update a system parameter 
one must evaluate the “ sensitivity ” (i.e., the gradient ) of E, Eq. (3), 
with respect to in the interval [t, t -f At]. Furthermore, using Eq. 
(3) and observing that the time integral and derivative with respect to 
commute, one can write 


30 


dE 
dp u 


'to 


dF 

dPn 


f*f of r tf 

dt= F~ dt + 

Jto d Pn Jt a 


dF du 


du dp 


dt 


(9) 




This sensitivity expression has two parts. The first term in the Right 
Hand Side (RHS) of Eq.(9) is called the “direct effect”, and corresponds 
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to the explicit dependence of the error functional on the system parame- 
ters. The second term in the RHS of Eq. (9) is referred to as the “indirect 
effect”, and corresponds to the implicit relationship between the error 
functional and the system parameters via u. In our learning for mali sm, 
the error functional, as defined by Eq. (3), does not depend explicitly 
on the system parameters; therefore, the “direct effect” vanishes, i.e., 


dF 

dPn 


(10a) 


Since F is known analytically (viz. Eqs. (3) and (4)), computation of 
dF/du is straightforward. Indeed 


dF 

du n 


(106) 


Thus, to enable evaluation of the error gradient using Eq. (9), the “in- 
direct effect” matrix du/dp should, in principle, be computed. 


TEACHER FORCING 

The neural activation dynamics specified by Eqs. (1) and (2) does 
not include explicit information regarding the desired network output. 
If these equations are used in conjunction with the learning formalism 
described in the previous section, the network parameters (i.e., the ele- 
ments of p ) will be modified at the end of a trajectory, i.e., at time 
as shown schematically in Figure lb. Such a parameter adaptation is 
based upon the total error between the desired and the actual output of 
the network, accumulated over the interval [t 0 ,tj]. Referring to Fig. lb, 
a neural network 2 is stimulated by a time-varying training vector I(t) 
to produce a time- varying output vector u(t). A subtractor 4 subtracts 
the output vector u(t) from a time- varying target vector a(t) to produce 
a time- varying error vector e(t). An integrator 6 integrates the error 
vector e(t) over the time period of the time- varying training vector I(t). 
At the end of the time period, the result of this integration is used by a 
gradient descent algorithm to change the parameters (e.g., the synapse 



weights) of the neural network in such a manner as to reduce the output 
of the integrator 6 in the next time period. In our earlier analogy to 
real-life behavior, this would correspond to a parent staying home, let- 
ting his child ride a bicycle and, after each trial, telling him all the errors 
he made. “Conventional” supervised learning operates in this fashion, 
and usually takes a great deal of iterations to produce the desired results. 

In order to overcome this difficulty, we consider the concept of 
teacher forcing , i.e., driving the output neurons to desired values in fi- 
nite time. Williams and Zipser [Williams, R.J., and Zipser, D. (1988) 
*‘A learning algorithm for continually running fully recurrent neural net- 
works,” Technical Report ICS Report 8805, UCSD, La Jolla, CA 92093] 
disclose forcing in a similar context. Their focus, however, is on discrete 
time problems. To highlight the differences between the two approaches 
we make the following observations. By definition, the conventional out- 
put of a network at time step (/ + 1), without teacher forcing, is a function 
of the external inputs to the network and of the networks’ states at time 
step (f), i.e., in our notation, 

u n {t + 1) = g n [Ii(t),Uj{t),p] 

where n £ So , i € Sj and j £ S/USfjUSo • To introduce teacher forcing, 
Williams and Zipser replace the output of the network with the desired 
output values at time step ( t ). This means that 

u n {t + 1 ) — g n [Ii{t), Uj{t),a u (t),p] 

where n £ So,i £ 5/ and j £ 5/ U Sjj. The network parameters can be 
updated either at the end of each time step, or at the end of the trajec- 
tory, i.e., at time if. A schematic block diagram of this model, in which 
the parameters are updated at the end of the trajectory, is given in Fig- 
ure 2. Referring to Fig. 2, at time t the neural network is ’’forced” to an 
output vector equal to the current target, vector a(t). The neural network 
then responds to the current training vector I(t) to produce an output 
vector u(t-|-l) at the next time step t+1. The subtractor 4 subtracts the 
output vector u(t+l) from the target vector a(t-f-l) of the next time step 



t+1 to produce an error vector e(t+l). Thereafter, the operation of the 
model of Fig. 2 is analogous to that of Fig. lb. The temporal behavior 
of this model is illustrated in Fig. 3, in which the neuron outputs are 
forced to the training target (zero-error) values at the end of each time 
step. Since the network outputs, u n (< + 1), n E So, are dependent upon 
the desired values a n (t ) of the network outputs at time step t, the algo- 
5 rithm can be interpreted as training the network to capture the velocity 
of given points on the trajectory, rather than the trajectory itself. In 
our earlier analogy, each time interval may be viewed as a learning ses- 
sion at the end of which the parent is correcting the child's performance. 

10 The teacher forcing paradigm of the present invention, on the other 

hand, stems from feedback control. In such a scheme, with continuous 
network dynamics, the error between the actual and the desired outputs 
is fed back, as inputs to the network output set neurons. A schematic 
block diagram of the invention is presented in Figure 4. Referring to Fig. 

15 4, on a simplistic level the operation of the invention is analogous to the 

model of Fig. lb discussed above. However, the invention modifies the 
error vector e(t) by a function \(t) and feeds the modified error vector 
back to the neural network 2 in real time. Preferably, this feedback is 
applied directly to the inputs of the array of output neurons of the neural 
20 network 2. As can be seen, the parameters of the network are updated 
based upon the error accumulated over the length of the trajectory, i.e., 
over the interval [/ G ,/y]. Again, by analogy, this scheme corresponds to 
a parent accompanying his child and holding the bicycle during the tra- 
jectory, to keep him on the right track as much as possible. At the end 
25 of the trajectory the parent would explain to his child what went wrong 
and where, so that corrective action can be taken for the next round. In 
order to incorporate this teacher forcing into the neural learning formal- 
ism presented earlier, the time-dependent input to the neural activation 
dynamics, Eq.(l). i.e.. I u {t) as given by Eq. (2), is modified to read: 

30 f a «(0 if n e Si 

/»(*) = < 0 if n e S H (11) 

[ A[a„(f)] 1 -f[a„(/) - u n (t)Y if n £ So 

At this stage, A and (3 are assumed to be positive constants. The purpose 
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of the term [fln (^)] 1 ^ is to insure that I n {t) has the same dimension as 
a n (t) and u n (t). It (1989) has been demonstrated that in general, for 
(3 = ( 2 i + l)/( 2 j + 1 ), i < j and i and j strictly positive integers, 

an expression of the form [a n — u n ]& induces a terminal attractor phe- 
nomenon for the dynamics described be Eq. (1). Barhen et al. [Barhen, 
J., Toomarian, N. and Gulati, S. (1990) “Adjoint operator algorithms 
5 for faster learning in dynamical neural networks,” in David S. Touretzky 
(Ed.), Advances in Neural Information Processing Systems , Vol. 2, pp. 
498-508, San Mateo, CA (Morgan Ivaufmann); and, Barhen, J., Toomar- 
ian, N. and Gulati, S. (1990) “Application of adjoint operators to neural 
learning,” Applied Mathematical Letters , Vol. 3, No. 3, pp. 13-18] have 
10 considered terminal attractor dynamics induced from the input set, 5/, 
rather than the output set. So- They have observed that such a dynamics 
enables to learn time-independent mappings much faster than backprop- 
agation. This provided the motivation for choosing (3 = 7/9 for the 

numerical simulations described below in this specification. Simulations 
15 with other positive constants, such as (3 = 1, have produced, qualita- 
tively, similar results, albeit over a longer training period. A study of 
the sensitivity of the results to the choice of j3 is beyond the scope of 
this specification. 

20 When learning is successfully completed [i.e., e n (t) = 0], teacher 

forcing will vanish, and the network will revert to the conventional dy- 
namics given by Eqs. (1) and (2). However, there might be instances 
where the error functional can not be reduced to zero, implying that 
the teacher forcing term will not vanish as learning proceeds. Thus, a 
25 discrepancy in results between the learning and recall mode of the net- 
work should be expected. In an attempt to overcome this problem, we 
recall another lesson from life. When a parent teaches his child to ride a 
bicycle, at early stages he keeps his hands on the bicycle, accompanying 
the child. However, as soon as the child shows some learned skills in con- 
trolling himself, the parent will take his hands off more and more often, 
to let the child ride independently. In this vein, the teacher’s interven- 
tion in the learning process preferably decreases as learning progresses. 
Specifically, in Equation ( 1 ) A may be modulated in time as function of 
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the error functional, according to 


A(r) = 1 - e- £(r) 


(12) 


The above expression should be understood as indicating that, while 
A varies on the learning time scale, it remains at essentially constant lev- 
els during the iterative passes over the interval \t 0 ,tf). 

The behavior in a time continuum of the neural network in the train- 
ing architecture of Fig. 4 is illustrated in Fig. 5, in accordance with the 
temporal evolutionary behavior defined by equation (1). At a given time 
t, the output of a given output neuron is u(t) while the target value for 
that neuion is a(t), which differs from the actual neuron output by an 
erroi e(t). The feedback of the error e(t), illustrated in Fig. 4, reduces 
the error at the next time differential, t+dt, by an amount f[e(t)] which 
is a function of the error e(t). Thus, without the invention, the neuron 
output at the next time differential t+dt would have been u’(t-fdt), but 
with the invention the error at t+dt is reduced by f[e(t)] to produce 
a neuron output u(t+dt) which is closer to the target output a(t+dt). 
The overall result is that the total error E(r) of equation (3) is reduced. 
In accordance with equation (12), the amount of correction, namely the 
proportion of the error e(t) fed back to the output neuron, is reduced as 
the total error E(r) of equation (3) is reduced at the end of each learning 
cycle of time duration At = [t 0 ,tf\. 

A significant advantage of the invention is that it works in the time 
continuum of the differential equation of Equation (3), while the tech- 
nique of Fig.'s 2 and 3 is a numerical simulation not realizable using 
analog neurons. 

Fig. 6 illustrates a very tutorial example corresponding to the ar- 
chitecture of Fig. 4, in which the error e(t) reduced by a factor of 1- 
exp[E(r)] is directly fed back to the inputs of the output neurons. As 



shown in Fig. 6, the neural network 10 includes a set of input neurons 
12, a set of hidden neurons 14 and a set of output neurons 16. The neu- 
rons 12, 14, 16 are selectively interconnected through weighted synapses 
(not shown in Fig. 6) whose weights are determined, along with the 
gains of the neurons, during a preliminary training exercise. During this 
exercise, a training set of time-dependent neuron inputs are applied dur- 
5 ing a predetermined time interval to the inputs of the input neurone 19 
which produces a set of neuron outputs u(t). An error vector e(t) is 
determined by a subtractor IS subtracting the vector of neuron outputs 
u(t) from the vector of target neuron outputs a(t). All elements of the 
error vector e(t) are squared and summed and integrated over the pre- 
10 determined time interval by the integrator 20 to produce the total error 
E(r) of Equation 3 at the end of the current training cycle, which is 
stored in a register 22. A multiplier 24 multiplies each component of the 
error vector e(t) by the factor 1- exp[E(r)], and the product is applied as 
feedback to the input of the corresponding output neuron 16. A conven- 
15 tional gradient descent algorithm 26. using the output of the integrator 
20 and the current values of the neuron gains and synaptic weights of 
the neural network 10. computes the desired changes to the gains and 
weights at the end of the predetermined time interval, which are then 
implemented in the neural network 10. The process is then repeated in 
2Q successive cycles with a cyclic period equal to the predetermined time 
interval, until the total error E(r) reaches zero. 

The operat ion of the system of Fig. 6 is illustrated in Fig. 7. Prelim- 
inarily, the neuron temporal behavior during the evolutionary learning 
25 process is defined by the differential equation of Equation (1) (block 30 
of Fig. 7) and a training set is defined for the inputs to the input neurons 
12 and for target outputs of the output neurons 16 (block 32 of Fig. 7). 
The training set neuron inputs are time dependent functions over the 
predetermined time interval. Next, the training set neuron inputs are 
applied to the inputs of the input neurons 12 for the predetermined time 
interval At = [t 0 ,t f ] (block 34 of Fig. 7) while the errors e(t) between 
the outputs of the output neurons 16 and the desired target outputs are 
monitored (block 36). The squares of the errors are summed and in- 
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tegrated over the predetermined time period (block 38) to produce the 
total error E(r) for the current learning cycle. The gradient descent al- 
gorithm is then performed (block 40) to compute the changes to each of 
the neural network parameters (e.g., neural gains and synaptic weights), 
and these changes are then added to the corresponding neural network 
parameters (block 42). If the total error E(r) of the current learning 
cycle is zero (or below some predetermined threshold), then the training 
session is finished (YES branch of block 44). Otherwise (NO branch of 
block 44), the system proceeds to the next learning cycle (block 46) and 
the process is repeated starting at block 34 of Fig. 7. 

The preferred embodiment of the invention is illustrated in Fig. 8. 
In Fig. 8, the error e(t) is scaled before being applied as feedback to the 
neural network 10. First, the error e(t) is raised to a selected exponential 
power (3 by a processor 50, while the target output a(t) is raised to a 
complementary exponential power 1 — /3 by a processor 52. The results 
are combined bv a multiplier 54 and the product is input to the multiplier 
24. The gradient descent algorithm 26 transmits neural gain adjustments 
to the neurons 12, 14, 16 and transmits synaptic weight adjustments to 
the synapses 53 in order to adjust the neuron gains and synapse weights 
at the end of each time interval. The gradient descent algorithm com- 
putes these adjustments based upon the output of the integrator 20 in a 
well- known manner. The skilled worker may devise various alternative 
techniques for scaling e(t) depending upon the specific application of the 
invention. 

GRADIENT DESCENT ALGORITHMS 

The efficient computation of system response sensitivities (e.g., error 
functional gradients) with respect to all parameters of a network’s archi- 
tecture plays a critically important role in neural learning. As mentioned 
previously herein, the gradient descent algorithm 26 may be any suitable 
gradient descent algorithm of the prior art. The following describes how 
one of the best gradient descent algorithms is employed in the invention. 


Direct Approach Gradient Descent Algorithm 



Let us differentiate the activation dynamics, Eq. (1), including the 
teacher forcing, Eq. (11), with respect to p^. We observe that the time 
derivative and partial derivative with respect to commute. Using the 
shorthand notation d( ■ • • )/dp M = (• • *),;x we obtain a set of equations 
referred to as “Forward Sensitivity Equations” (FSEs): 
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In the above expressions, represents the derivative of g n with respect 
to its arguments, 6 denotes the Ivronecker symbol and S n is defined 
as a nonhomogeneous "source . The source term contains all explicit 
derivatives of the neural activation dynamics, Eq. (1), with respect to 
the system parameters, . Hence, it is parameter dependent and its size 
is ( N x M). The initial conditions of the activation dynamics, Eq.(l), 
are excluded from the vector of system parameters p. Thus, the initial 
conditions of the FSEs will be taken as zero. Their solution will provide 
the matrix du/dp needed for computing the "indirect effect’ contribu- 
tion to the sensitivity of the error functional, as specified by Eq. (9). 
This gradient descent algorithm is, essentially, similar to the scheme pro- 
posed in the above-referenced publication by Williams and Zipser (1989). 
Computation of the gradients using the forward sensitivity formalism re- 
quires solving Eq. (13) M times, since the source term, explicitly 

depends on This system has N equations, each of which requires 
multiplication and summation over N neurons. Hence, the computa- 
tional complexity, measured in terms of multiply-accumulates, scales like 
N~ per system parameter, per time step. Let us assume, furthermore, 
that the interval [to-tf] is discretized into L time steps. Then, the to- 
tal number of multiply-accumulate operations scales like N 4 L. Clearly, 
such a scheme exhibits expensive scaling properties, and would not be 
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very practical for large networks. On the other hand, since the FSEs 
are solved forward in time, along with the neural dynamics, the method 
also has inherent advantages. In particular, there is no need for a large 
amount of memory. Since u n ^ has TV 3 + 27V 2 components, the storage 
requirement scales as 0(N 3 ). 

If the foregoing is employed for the gradient descent algorithm 26 
of Fig. 6, then the step of performing a gradient descent algorithm of 
Fig. 7 (block 40 of Fig. 7) may be broken into steps 40a, 40b and 40c as 
illustrated in Fig. 9b. Specifically, the first step (block 40a of Fig. 9b) 
of the gradient descent algorithm 26 is to derive the forward sensitivity 
equations (Equations 13-15) from the neural learning behavior (Equa- 
tion 1). The next step is to solve the forward sensitivity equations once 
for each of the M neural network parameters (block 40b of Fig. 9b). The 
third step (block 40c of Fig. 9b) is to compute the partial derivative of 
each neuron output u(t) with respect to each of the M network parame- 
ters. Finally, the computation step of block 42’ of Fig. 9b employs the 
integral of this derivative to compute the change to the corresponding 
network parameter at the end of the current learning cycle. 

NUMERICAL SIMULATIONS 

The embodiment of Fig. 8 has been applied to the problem of learn- 
ing two trajectories: a circle and a figure eight in computer simulations. 
Results of applying prior art techniques to these problems can be found 
in the literature, and they offer sufficient complexity for illustrating the 
computational efficiency of our proposed formalism. 

In the following computer simulations, the network that was trained 
to produce these trajectories using the present invention involved 6 fully 
connected neurons, with no input, 4 hidden and 2 output units. An 
additional “bias” neuron was also included. In these simulations, the 
dynamical systems were integrated using a first order finite difference 
approximation. The neuron sigmoidal nonlinearity was modeled by a 
hyperbolic tangent. Throughout, the decay constants k„, the neural 
gains 7„, and A were set to one. Furthermore, (3 was selected to be 7/9. 



For the learning dynamics, At was set to 6.3 and r/ to 0.015873. The 
two output units were required to oscillate according to 


a 5 (t) = A sin cut 

(16a) 

ae{t) = A cosu+ 

(166) 

for the circular trajectory, and, according to 


a 5 {t) = A sin ut 

(17a) 

a Q {t ) = A sin 2u+ 

(176) 


10 for the figure eight trajectory. Furthermore, we took A = 0.5 and u = 1. 
Initial conditions were defined at t Q = 0. Plotting a 5 versus a 6 produces 
the “desired” trajectory. Since the period of the above oscillations is 
2 tt, t f = 27 r time units are needed to cover one cycle. We selected 

A/ = 0.1, to cover one cycle in approximately 63 time steps. 

15 

Circular Trajectory 

In order to determine the capability and effectiveness of the algo- 
rithm, three cases were examined. As initial conditions, the values of u n 
were assumed to be uniform random numbers between -0.01 and 0.01 for 
20 the simulation studies referred in the sequel as “Case - 1’ and Case - 
2”. For Case - 3, we set u n equal to zero, except u& which was set to 0.5. 
The synaptic interconnections were initialized to uniform random values 
between -0.1 and +0.1 for all three experiments. 

25 CASE - 1. 

The training was performed over tf = 6.5 time units( i.e., 65 time 
intervals). A maximum number of 500 iterations was allowed. The re- 
sults shown in Fig. 10 were obtained by* starting the network with the 
same initial conditions. u n { 0). as used for training, the learned values of 
2 Q the synaptic interconnections, T nm , and with no teacher forcing (A 0). 
As we can see, it takes about 2 cycles until the network reaches a consis- 
tent trajectory. Despite the fact that the system’s output was plotted for 
more than 15 cycles, only the first. 2 cycles can be distinguished. Figure 
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13 demonstrates that most of the learning occurred during the first 300 
iterations. 

CASE - 2. 

Here, we decided to increase the length of the trajectory gradually. 
A maximum number of 800 learning iterations w r as now allowed. The 
5 length of the training trajectory was 65 time intervals for the first 100 
iterations, and increased every 100 iterations by 10 time intervals. There- 
fore, it was expected that the error functional would increase whenever 
the length of the trajectory w r as increased. This was indeed observed, as 
may be seen from the learning graph, shown in Fig. 13. The output of 
10 the trained network is illustrated in Fig. 11. Here again, from 15 recall 
cycles, only the first two (needed to reach the steady orbit) are distin- 
guishable and the rest overlap. Training using greater trajectory lengths 
yielded a recall circle much closer to the desired one than in the previous 
case. From Fig. 13. one can see that the last 500 iterations did not en- 
15 hance dramatically the performance of the network. Thus, for practical 
purposes, one may stop the training after the first 300 iterations. 

CASE - 3. 

The selection of appropriate initial conditions for u n plays an im- 
portant role in the effectiveness of the learning. Here, all initial values of 
20 u n were selected to be exactly zero except the last unit, where uq = 0.5 
was chosen. This corresponds to an initial point on the circle. The length 
of the trajectory was increased successively, as in the previous case. In 
spite of the fact that we allowed the system to perform up to 800 itera- 
tions, the learning was essentially completed in about 200 iterations, as 
25 shown in Fig. 13. The results of the network’s recall are presented in 
Fig. 12, which shows an excellent match. 

Figure Eight Trajectory 

For this problem, the synaptic interconnections were initialized to 
2 q uniform random values between -1 and +1. As initial conditions, the 
values of u n were assumed to be uniform random numbers between -0.01 
and 0.01. The following three situations were examined. 

CASE - 4. 
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The training was performed over #/ = 6.5 time units( i.e., 65 time 
intervals). A maximum number of 1000 iterations was allowed. The re- 
sults shown in Fig. 14 were obtained by starting the network with the 
same initial conditions, «„(0), as used for training, the learned values of 
the synaptic interconnections, T nm , and with no teacher forcing (A = 0). 
As we can see, it takes about 3 cycles until the network reaches a con- 
5 sistent trajectory. Despite the fact that the system’s output was plotted 
for more than 15 cycles, only the first 3 cycles can be distinguished. 

CASE - 5. 

Here, we again decided to increase the length of the trajectory grad- 
10 uallv. A maximum number of 1000 iterations was now allowed. The 
length of the training trajectory was 65 time intervals for the first 100 
iterations, and was increased every 100 iterations by 5 time intervals. 
Therefore, it was again expected that the objective functional would in- 
crease whenever the length of the trajectory was increased. This was 
15 indeed observed, as may be seen from the learning graph, shown in Fig. 
17. The output of the trained network is illustrated in Fig. 15. Here 
again, from 15 recall cycles, only the first, three (needed to reach the 
steady orbit) are distinguishable, and the rest overlap. As a direct result 
of training using greater trajectory lengths, orbits much closer to the 
20 desired one than in the previous case were obtained. 

CASE - 6. 

The learning in this case was performed under conditions similar to 
CASE - 5. but with the distinction that A was modulated according to 
Eq. (12). The results of the network's recall are presented in Fig. 16, 
25 and demonstrate a dramatic improvement with respect to the previous 
two cases. 

It is important to keep in mind the following observations with re- 
gard to the foregoing simulation results: 

3Q 1) For the circular trajectory. A was kept constant throughout the 

simulations and not modulated according to Eq. (12). As we can see 
from Fig. 13, in cases 1 and 2 the error functional was not reduced to 
zero. Hence, a discrepancy in the functional form of the neural activation 
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dynamics used during the learning and recall stages occurred. This was 
a probable cause for the poor performance of the network. In case 3, 
however, the error functional was reduced to zero. Therefore, the teacher 
forcing effect vanished by the end of the learning. 

2) For the figure eight trajectory, the differences between cases 5 and 
6 lies in the modulation of A, (i.e., the amplitude of the teacher forcing). 
5 Even though in both cases the error functional was reduced to a negli- 
gible level, the effect of the teacher forcing in case 5 was not completely 
eliminated over the entire length of the trajectory. This points toward 
the fact that modulation of A not only reduces the number of iterations 
but also provides higher quality results. 

10 

In order to assess the effectiveness of the new method, the forego- 
ing simulations applied it to two examples of representative complex- 
ity which have recently been analyzed in the open literature. We have 
demonstrated that a circular trajectory can be learned in approximately 
15 200 iterations compared to the 12000 reported by Pearlmutter (1989). 

A figure eight trajectory was achieved in under 500 iterations compared 
to 20000 previously required. Most important, however, is the quality of 
the obtained results. The trajectories computed using our new method 
are much closer to the target trajectories than was reported in previous 
20 studies. 

While the invention has been described in accordance with the pre- 
ferred embodiment in which the feedback is reduced as a function of 
the error E(t) over successive learning cycles, it may be that in some 
25 instances such a decrease will not be steady and may not even occur in 
individual cycles. Moreover, other schemes to modulate the feedback in 
accordance with the invention may be employed. For example, in those 
cases where a steady decrease in the error E(t) over successive cycles 
may be generally expected, the feedback could be modulated as a func- 
tion of the number of cycles independently or dependently of the error 
E{t). Moreover, while the invention has been described with reference 
to a gradient descent algorithm used to adjust both the synapse weights 
and the neuron gains, any subset or all of these neural network parame- 
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ters or other neural network parameters may be adjusted. For example, 
it may be that only the synapse weights would be adjusted at the end of 

each repetitive cycle. 

While the invention has been described in connection with training 
a neural network with time- varying training vectors and target vectors, 
5 the invention may also be applied in training neural networks with time- 
invariant training vectors and target vectors. 


While the invention has been described in detail by specific reference 
to preferred embodiments of the invention, it is understood that varia- 
10 tions and modifications thereof may be made without departing from the 
true spirit and scope of the invention. 


15 


20 


25 


30 


22 



FAST TEMPORAL NEURAL LEARNING USING 
TEACHER FORCING 

ABSTRACT OF THE INVENTION 
A neural network is trained to output a time dependent target vec- 
tor defined over a predetermined time interval in response to a time 
dependent input vector defined over the same time interval by apply- 
ing corresponding elements of the error vector, or difference between the 
target vector and the actual neuron output vector, to the inputs of cor- 
responding output neurons of the network as corrective feedback. This 
feedback decreases the error and quickens the learning process, so that 
a much smaller number of training cycles are required to complete the 
learning process. A conventional gradient descent algorithm is employed 
to update the neural network parameters at the end of the predetermined 
time interval. The foregoing process is repeated in repetitive cycles until 
the actual output vector corresponds to the target vector. In the pre- 
ferred embodiment, as the overall error of the neural network output 
decreases during successive training cycles, the portion of the error fed 
back to the output neurons is decreased accordingly, allowing the net- 
work to learn with greater freedom from teacher forcing as the network 
parameters converge to their optimum values. The invention may also 
be used to train a neural network with stationary training and target 
vectors. 
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